{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hypothesis Testing\n",
    "\n",
    "\n",
    "We would like to know if the effects we see in the sample(observed data) are likely to occur in the population. \n",
    "\n",
    "The way classical hypothesis testing works is by conducting a statistical test to answer the following question:\n",
    "> Given the sample and an effect, what is the probability of seeing that effect just by chance?\n",
    "\n",
    "Here are the steps on how we would do this\n",
    "\n",
    "1. Compute test statistic\n",
    "2. Define null hypothesis\n",
    "3. Compute p-value\n",
    "4. Interpret the result\n",
    "\n",
    "If p-value is very low(most often than now, below 0.05), the effect is considered statistically significant. That means that effect is unlikely to have occured by chance. The inference? The effect is likely to be seen in the population too. \n",
    "\n",
    "This process is very similar to the *proof by contradiction* paradigm. We first assume that the effect is false. That's the null hypothesis. Next step is to compute the probability of obtaining that effect (the p-value). If p-value is very low(<0.05 as a rule of thumb), we reject the null hypothesis. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import matplotlib as mpl\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "sns.set(color_codes=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "weed_pd = pd.read_csv(\"../data/Weed_Price.csv\", parse_dates=[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "weed_pd[\"month\"] = weed_pd.date.apply(lambda x: x.month)\n",
    "weed_pd[\"year\"] = weed_pd.date.apply(lambda x: x.year)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>State</th>\n",
       "      <th>HighQ</th>\n",
       "      <th>HighQN</th>\n",
       "      <th>MedQ</th>\n",
       "      <th>MedQN</th>\n",
       "      <th>LowQ</th>\n",
       "      <th>LowQN</th>\n",
       "      <th>date</th>\n",
       "      <th>month</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Alabama</td>\n",
       "      <td>339.06</td>\n",
       "      <td>1042</td>\n",
       "      <td>198.64</td>\n",
       "      <td>933</td>\n",
       "      <td>149.49</td>\n",
       "      <td>123</td>\n",
       "      <td>2014-01-01</td>\n",
       "      <td>1</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Alaska</td>\n",
       "      <td>288.75</td>\n",
       "      <td>252</td>\n",
       "      <td>260.60</td>\n",
       "      <td>297</td>\n",
       "      <td>388.58</td>\n",
       "      <td>26</td>\n",
       "      <td>2014-01-01</td>\n",
       "      <td>1</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Arizona</td>\n",
       "      <td>303.31</td>\n",
       "      <td>1941</td>\n",
       "      <td>209.35</td>\n",
       "      <td>1625</td>\n",
       "      <td>189.45</td>\n",
       "      <td>222</td>\n",
       "      <td>2014-01-01</td>\n",
       "      <td>1</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Arkansas</td>\n",
       "      <td>361.85</td>\n",
       "      <td>576</td>\n",
       "      <td>185.62</td>\n",
       "      <td>544</td>\n",
       "      <td>125.87</td>\n",
       "      <td>112</td>\n",
       "      <td>2014-01-01</td>\n",
       "      <td>1</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>California</td>\n",
       "      <td>248.78</td>\n",
       "      <td>12096</td>\n",
       "      <td>193.56</td>\n",
       "      <td>12812</td>\n",
       "      <td>192.92</td>\n",
       "      <td>778</td>\n",
       "      <td>2014-01-01</td>\n",
       "      <td>1</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        State   HighQ  HighQN    MedQ  MedQN    LowQ  LowQN       date  month  \\\n",
       "0     Alabama  339.06    1042  198.64    933  149.49    123 2014-01-01      1   \n",
       "1      Alaska  288.75     252  260.60    297  388.58     26 2014-01-01      1   \n",
       "2     Arizona  303.31    1941  209.35   1625  189.45    222 2014-01-01      1   \n",
       "3    Arkansas  361.85     576  185.62    544  125.87    112 2014-01-01      1   \n",
       "4  California  248.78   12096  193.56  12812  192.92    778 2014-01-01      1   \n",
       "\n",
       "   year  \n",
       "0  2014  \n",
       "1  2014  \n",
       "2  2014  \n",
       "3  2014  \n",
       "4  2014  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weed_pd.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Let's work on weed prices in California in 2014\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "weed_ca_2014 = weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2014)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean: 245.894230769\n",
      "Standard Deviation: 1.28990793937\n"
     ]
    }
   ],
   "source": [
    "#Mean and standard deviation of high quality weed's price\n",
    "print \"Mean:\", weed_ca_2014.HighQ.mean()\n",
    "print \"Standard Deviation:\", weed_ca_2014.HighQ.std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(245.761718492726, 246.02674304573577)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Confidence interval on the mean\n",
    "stats.norm.interval(0.95, loc=weed_ca_2014.HighQ.mean(), scale = weed_ca_2014.HighQ.std()/np.sqrt(len(weed_ca_2014)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: Are high-quality weed prices in Jan 2014 significantly higher than in Jan 2015?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "#Get the data\n",
    "weed_ca_jan2014 = np.array(weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2014) & (weed_pd.month==1)].HighQ)\n",
    "weed_ca_jan2015 = np.array(weed_pd[(weed_pd.State==\"California\") & (weed_pd.year==2015) & (weed_pd.month==1)].HighQ)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Mean-2014 Jan: 248.445483871\n",
      "Mean-2015 Jan: 243.602258065\n"
     ]
    }
   ],
   "source": [
    "print \"Mean-2014 Jan:\", weed_ca_jan2014.mean()\n",
    "print \"Mean-2015 Jan:\", weed_ca_jan2015.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Effect size: 4.84322580645\n"
     ]
    }
   ],
   "source": [
    "print \"Effect size:\", weed_ca_jan2014.mean() - weed_ca_jan2015.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Null Hypothesis**: Mean prices aren't significantly different\n",
    "\n",
    "Perform **t-test** and determine the p-value. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Ttest_indResult(statistic=98.011325238158051, pvalue=6.2979718185084028e-68)"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.ttest_ind(weed_ca_jan2014, weed_ca_jan2015, equal_var=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "p-value is the probability that the effective size was by chance. And here, p-value is almost 0.\n",
    "\n",
    "*Conclusion*: The price difference is significant. But is a price increase of $4.85 a big deal? The price decreased in 2015 by almost 2%. Always remember to look at effect size. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Problem** Determine if prices of medium quality weed for Jan 2015 and Feb 2015 are significantly different for New York. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Assumption of t-test\n",
    "\n",
    "One assumption is that the data used came from a normal distribution. \n",
    "<br>\n",
    "There's a [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro-Wilk) to test for normality. If p-value is less than 0.05, then there's a low chance that the distribution is normal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.9469053149223328, 0.12818680703639984)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.shapiro(weed_ca_jan2015)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(0.9353488683700562, 0.06141229346394539)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.shapiro(weed_ca_jan2014)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#We seem to be good."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A/B testing\n",
    "\n",
    "Comparing two versions to check which one performs better. Eg: Show to people two variants for the same webpage that they want to see and find which one provides better conversion rate (or the relevant metric). [wiki](https://en.wikipedia.org/wiki/A/B_testing)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise: Impact of regulation and deregulation.**\n",
    "\n",
    "Information on regulation of Weed in the US by State [wiki](Impact of regulation and deregulation on a couple of states )\n",
    "\n",
    "1. Alaska legalized it on 4th Nov 2014. Find if prices significantly changed in Dec 2014 compared to Oct 2014. \n",
    "2. Maryland decriminalized possessing weed from Oct 1, 2014. Find if prices of weed changed significantly in Oct 2014 compared to Sep 2014"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Something to think about: Which of these give smaller p-values ? </h2>\n",
    "   \n",
    "   * Smaller effect size\n",
    "   * Smaller standard error\n",
    "   * Smaller sample size\n",
    "   * Higher variance\n",
    "   \n",
    "   **Answer:** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chi-square tests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Chi-Square tests are used when the data are frequencies, rather than numerical score/price.\n",
    "\n",
    "The following two tests make use of chi-square statistic\n",
    "\n",
    "1. chi-square test for goodness of fit\n",
    "2. chi-square test for independence\n",
    "\n",
    "Chi-square test is a non-parametric test. They do not require assumptions about population parameters and they do not test hypotheses about population parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2> Chi-Square test for goodness fit </h2>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ \\chi^2 = \\sum (O - E)^2/E $$\n",
    "\n",
    "* O is observed frequency\n",
    "* E is expected frequency\n",
    "* $ \\chi $ is the chi-square statistic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's assume the proportion of people who bought High, Medium and Low quality weed in Jan-2014 as the expected proportion. Find if proportion of people who bought weed in Jan 2015 conformed to the norm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "weed_jan2014 = weed_pd[(weed_pd.year==2014) & (weed_pd.month==1)][[\"HighQN\", \"MedQN\", \"LowQN\"]]\n",
    "weed_jan2015 = weed_pd[(weed_pd.year==2015) & (weed_pd.month==1)][[\"HighQN\", \"MedQN\", \"LowQN\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "Expected = np.array(weed_jan2014.apply(sum, axis=0))\n",
    "Observed = np.array(weed_jan2015.apply(sum, axis=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Expected: [2918004 2644757  263958] \n",
      "Observed: [4057716 4035049  358088]\n"
     ]
    }
   ],
   "source": [
    "print \"Expected:\", Expected, \"\\n\" , \"Observed:\", Observed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Expected: [ 0.5007971   0.45390159  0.04530131] \n",
      "Observed: [ 0.48015461  0.47747239  0.042373  ]\n"
     ]
    }
   ],
   "source": [
    "print \"Expected:\", Expected/np.sum(Expected.astype(float)), \"\\n\" , \"Observed:\", Observed/np.sum(Observed.astype(float))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Power_divergenceResult(statistic=1209562.2775169075, pvalue=0.0)"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stats.chisquare(Observed, Expected)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Inference* : We reject null hypothesis. The proportions in Jan 2015 is different than what was expected."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
